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Abstract: Ordering of regression or classification coefficients occurs in many real-world applications. 
Fused Lasso exploits this ordering by explicitly regularizing the differences between neighboring co- 
efficients through an i\ norm regularizer. However, due to nonscparability and nonsmoothness of the 
regularization term, solving the fused Lasso problem is computationally demanding. Existing solvers 
C " 3 i can only deal with problems of small or medium size, or a special case of the fused Lasso problem in 

£Nj , which the predictor matrix is identity matrix. In this paper, we propose an iterative algorithm based 

on split Bregman method to solve a class of large-scale fused Lasso problems, including a generalized 
fused Lasso and a fused Lasso support vector classifier. We derive our algorithm using augmented 
Lagrangian method and prove its convergence properties. The performance of our method is tested 
on both artificial data and real-world applications including protcomic data from mass spectrometry 
VP , and genomic data from array CGH. We demonstrate that our method is many times faster than the 

■ existing solvers, and show that it is especially efficient for large p, small n problems. 
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1. Introduction 



Regularization terms that encourage sparsity in coefficients arc increasingly being used in regression and clas- 
sification procedures. One widely used example is the Lasso procedure for linear regression, which minimizes 
the usual sum of squared errors, but additionally penalizes the l\ norm of the regression coefficients. Because 
of the non-differentiability of the l\ norm, the Lasso procedure tends to shrink the regression coefficients 
' toward zero and achieves sparseness. Fast and efficient algorithms are available to solve Lasso with as many 

QO , as millions of variables, which makes it an attractive choice for many large-scale real-world applications. 

The fused Lasso method introduced by Tibshirani et al. [1] is an extension of Lasso, and considers the 
situation where there is certain natural ordering in regression coefficients. Fused Lasso takes this natural 
• ordering into account by placing an additional regularization term on the differences of "neighboring" coef- 

ficients. Consider the linear regression of {(xj, yi)}™^, where Xj = (xn, . . . , Xi P ) T are the predictor variables 
and yi are the responses. (We assume Xij,yi arc standardized with zero mean and unit variance across dif- 
ferent observations.) Fused Lasso finds the coefficients of linear regression by minimizing the following loss 
function 



<f> (/ ? ) = ^E U-$>«& +Al£>i|+A 2 |, (f) 

t=l V j=l J i=l i=2 

where the regularization term with parameter Ai encourages the sparsity of the regression coefficients, while 
the regularization term with parameter A2 shrinks the differences between neighboring coefficients toward 
zero. As such, the method achieves both sparseness and smoothness in the regression coefficients. 

Regression or classification variables with some inherent ordering occur naturally in many real-world 
applications. In genomics, chromosomal features such as copy number variations (CNV), epigenetic modifi- 
cation patterns, and genes are ordered naturally by their chromosomal locations. In proteomics, molecular 
fragments from mass spectrometry (MS) measurements are ordered by their mass-to-charge ratios (m/z). 
In dynamic gene network inference, gene regulatory networks from developmcntally closer cell types are 
more similar than those from more distant cell types [2]. Fused Lasso exploits these natural ordering, so, 
not surprisingly, it has found applications in these areas particularly suitable. For example, Tibshirani and 
Wang successfully applied fused Lasso to detect DNA copy number variations in tumor samples using array 
comparative genomic hybridization (CGH) data [3]. Tibshirani ct al. used fused Lasso to select protcomic 
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features that can separate tumor vs normal samples [1]. In addition to the application areas mentioned 
above, fused Lasso or the extension of it has also found applications in a number of other areas, including 
image dcnoising, social networks [2], quantitative trait network analysis [4], etc. 

The loss function in (1) is strictly convex, so a global optimal solution is guaranteed to exist. However, 
finding the optimal solution is computationally challenging due to the nondiffcrcntiability of $(/?). Existing 
methods circumvent the nondiffcrcntiability of $(/3) by introducing 2p— 1 additional variables and converting 
the unconstrained optimization problem into a constrained one with 6p — 1 linear inequality constraints. 
Standard convex optimization tools such as SQOPT [5] and CVX [6] can then be applied. Because of the 
large number of variables introduced, these methods are computationally demanding in terms of both time 
and space, and, in practice, have only been able to solve fused Lasso problems with small or medium sizes. 

Component-wise coordinate descent has been proposed as an efficient approach for solving many l\ regu- 
larized convex optimization problems, including Lasso, grouped Lasso, elastic nets, graphical Lasso, logistic 
regression, etc [7]. However, coordinate descent cannot be applied to the fused Lasso problem because the 
variables in the loss function $(/3) are nonseparablc due to the second regularization term, and as such, 
convergence is not guaranteed [8] . 

For a special class of fused Lasso problems, named fused Lasso signal approximator (FLSA), where the 
predictor variables Xij = 1 for all i = j and otherwise, there are algorithms available to solve it efficiently. 
A key observation of FLSA first noticed by Friedman et al. is that for fixed Ai, increasing A2 can only 
cause pairs of variables to fuse and they become unfused for any larger values of A2 . This observation allows 
Friedman et al. to develop a fusion algorithm to solve FLSA for a path of A2 values by keeping track of 
fused variables and using coordinate descent for component-wise optimization. The fusion algorithm was 
later extended and generalized by Hocfling [9]. However, for the fusion algorithm to work, the solution path 
as a function of A2 has to be piecewise linear , which is not true for the general fused Lasso problem [10] . As 
such, these algorithms are not applicable to the general fused Lasso case. 

In this paper, we propose a new method based on the split Bregman iteration for solving the general 
fused Lasso problem. Although the Bregman iteration was an old technique proposed in the sixties [11, 12], 
it gained significant interest only recently after Oshcr and his coauthors demonstrated its high efficiency for 
image restoration [13, 14, 15]. Most recently, it has also been shown to be an efficient tool for compressed 
sensing [16, 17, 18], matrix completion [19] and low rank matrix recovery [20]. In the following, we will show 
that the general fused Lasso problem can be reformulated so that split Bregman iteration can be readily 
applied. 

The rest of the paper is organized as follows. In Section 2, we derive algorithms for a class of fused Lasso 
problems from augmented Lagrangian function including SBFLasso for general fused Lasso, SBFLSA for 
FLSA and SBFLSVM for fused Lasso support vector classifier. The convergence properties of our algorithms 
are also presented. We demonstrate the performance and effectiveness of the algorithm through numerical 
examples in Section 3, and describe additional implementation details. Algorithms described in this paper 
are implemented in Mat lab and are freely available from the authors. 

2. Algorithms 

2. 1 . Split Bregman iteration for a generalized fused Lasso problem 

Wc first describe our algorithm in a more general setting than the one described in (1). Instead of the 
quadratic error function, we allow the error function to be any convex function of the regression coefficients. 
In addition, we relax the assumption that the coefficients should be ordered along a line as in (1), and allow 
the ordering to be specified arbitrarily, e.g., according to a graph. For the generalized fused Lasso, we find 
/3 by solving the following unconstrained optimization problem 

mmV(0) + \ 1 \\P\\ 1 + \ 2 \\Lp\\ 1 , (2) 

where V(/3) = V(j3; X, y) is the error term, the regularization term with parameter Ai encourages the sparsity 
of ft as before, and the regularization term with parameter A2 shrinks the differences between neighboring 
variables as specified in matrix L toward zero. We assume L is an m x p matrix. In the standard fused Lasso 
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in (1), L is simply a (p — 1) xp matrix with zeros entries everywhere except 1 in the diagonal and —1 in the 
superdiagonal. The unconstrained problem (2) can be reformulated into an equivalent constrained problem 

min F( ) 8) + A 1 ||a|| 1 + A 2 ||6|| 1 

s.t. a = ft 

b = Lft. (3) 

Although split Brcgman methods originated from Bregman iterations [16, 13, 18, 21, 14, 15], it is more 
convenient to derive the split Brcgman method for the generalized Lasso using the augmented Lagrangian 
method [22, 23]. 

Note that the Lagrangian function of (3) is 

C(fi, a, b, u, v) = V(ft) + Ai ||o||i + A 2 ||&||i + («, ft - a) + {v, Lft-b), (4) 

where u £ W is a dual variable corresponding to the linear constraint ft = a, v G K m is a dual variable 
corresponding to the linear constraint Lft = 6, (■, ■) denotes the standard inner product in Euclidean space. 
The augmented Lagrangian function of (3) is similar to the (4) except for adding two terms — a\\ 2 + 

%||I//3 — b\\\ to penalize the violation of linear constraints ft — a and Lft = b. That is, 

C(ft,a,b,u,v) = V(ft) + AxlloHx + A 2 ||6||i + (u,ft- a) + (v,Lf3 -b) + ^\\ft- a\\\ + ^||L/3 - b\\%, (5) 

where /ii > and /i 2 > are two parameters. 

Consider the problem of finding a saddle point (/?*, a*, b*, u*, v*) of the augmented Lagrangian function 
>C(/3, a, b, u, v) such that 

L(fi* ,a* ,b* ,u,v) < C(f3* ,a* ,b* ,u* ,v*) < C{P,a,b,u* ,v*) (6) 

for all j3, a, b, u and v. It can be shown that (3* is an optimal solution of (2) if and only if {fi* , a*, b* , u* , v*) 
solves the above saddle point problem for some a*, b* , u*, and v*. 

We solve the saddle point problem through an iterative algorithm by alternating between the primal and 
the dual optimization as follows 

Primal: (f3 k+1 , a k+1 , b k+1 ) — argmin i g iai6 £(/3, a, 6, u k , v k ) 
Dual : u k+1 = u k + 5i{p k+1 - a k+1 ), v k+1 = v k + 8 2 (Lf3 k+1 - b k+1 ) 

where the first step updates the primal variables based on the current estimate of u k and v k , while the second 
step updates the dual variables based on the current estimate of the primal variables. Since the augmented 
Lagrangian function is linear in u and v, updating the dual variables are relatively easy and we use gradient 
ascent algorithm with step size 5i and 82- 

The efficiency of the iterative algorithm (7) lies on whether the primal problem can be solved quickly. 
The augmented Lagrangian function £ still contains nondiffcrcntiable terms. But different from the original 
objective function in (2), the l\ induced nondiffcrcntiability has now been transferred from terms involving ft 
to terms involving a and b only. Moreover, the nondifferentiable terms involving a and b are now completely 
decoupled, and thus we can solve the primal problem by alternating minimization of ft, a and b, 

'pk+i = argmin/3 V (ft) + (u k ,ft- a k ) + (v k ,Lft - b k ) + %t\\ft- a k \\ 2 2 + f\\Lft - b k \\ 2 2 
< a k+1 =argiRm a X 1 \\a\\ 1 + (u k ,ft k+1 -a) + if\\ft k+1 -a\\ 2 2 (8) 
b k+i = argm i nfe a 2 \\b\l! + (v k ,Lft k+1 -b) + if\\Lft k+1 -b\\l 

Minimization of a and b in (8) can be done efficiently using soft thresholding, because the objective 
functions are quadratic and nondiffcrcntiable terms are completely separable. Let 7a be a soft thresholding 
operator defined on vector space and satisfying 

T\{w) = [t\(wi),t\(w 2 ), ...... .] T , with t\(wi) = sgn(wi) max{0, \wi\ - A}. (9) 
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Using the soft thresholding operator, the optimal solution of a and b in (8) can be written as 



a k+1 ^T^ Xi ((3 k+1 +^ 1 u k ) and b k+1 = , 2 (L/? fe+1 + ^\ k ). (10) 

Therefore, the efficiency of the iterative algorithm depends entirely on whether the minimization of j3 in (8) 
can be done efficiently. If V((3) is a quadratic function as in the standard fused Lasso, the optimal solution 
f3 k+1 can be found analytically. 

In theory, the alternate minimization between the primal variables needs to run multiple times until 
convergence. However, we do not have to completely solve the primal problem since it is only one step of 
the overall iterative algorithm. Our algorithm uses only one alternation. Overall, we propose Algorithm 1 
for solving the saddle point problem (6), and consequently the problem (2). 

Algorithm 1: Split Bregman method for the generalized Fused Lasso (2) 

Initialize /9°, a°,b°,u a , and v°. 
repeat 

1) /3 k+1 = argmin/3 V(/3) + {u k , /3 - a k ) + {v k , Lf3 - b k ) + ^||/3- a k \\% + f-\\L0 - fe fe ||| 

2) a k+1 =T-i Ai (/3 fc+1 +^ 1 u k ) 

3 ) fc fc+1 =T 2 - lA2 (L/3 fc + 1 +M 2 ~V) 

4) u k+1 = u k + <5i(/3 fc+1 - a k+1 ) 

5) v k + l = v k + 8 2 (Ll3 k + 1 - b k + 1 ) 
until 

Convergence 

The convergence property of Algorithm 1 is shown in the following theorem, which we prove in the 
Supplementary Info. 

Theorem 1. Suppose there exists at least one solution /3* of (2). Assume V(f3) is convex, < 8 < fix, 

< 82 < fJ.2, and Ai > 0, A2 > 0. Then the following property for the split Bregman iteration in Algorithm 

1 holds: 

lim U(/3) + A 1 ||/3 fe || 1 + A 2 ||L/3 fe || 1 + Ai||/3*||i + A 2 ||L/3*|| 1 . (11) 

k— >oo 

Furthermore, 

lim -[3*\\ = (12) 

k— >oo 

whenever (2) has a unique solution. 

Note that the condition for the convergence in Theorem 1 is quite easy to satisfy. Ai, A2 are regularization 
parameters and should always be larger than zero. So as long as < Si < \i\ and < 82 < fJ-2, the algorithm 
converges. In our implementation, we just choose Si = [ii and 82—^2- 

2.2. Split Bregman for the standard fused Lasso (SBFLasso) 

Next we apply Algorithm 1 to solve the standard fused Lasso problem (1), which constitutes a special case 
of the generalized fused Lasso problem with 

V{frX,y)= l -\\Xf3-y\\l 

and 

LP = 08a - A, p 3 AVif, (13) 

where X = (x tJ )"f lj=1 and y = (yi, y n ) T . 

The objective function on minimizing j3 in Algorithm 1 is now quadratic and diffcrcntiablc, and thus the 
optimal solution can be found by solving a set of linear equations: 

(X T X + fnl + f i 2 L T L)p k+1 = X T y + m(a k - ^ 1 u k ) + f i 2 L T (b k - m^V), (14) 
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while the other four steps in Algorithm 1 are easy to implement and can be computed quickly. So the 
efficiency of the algorithm largely depends on how fast the linear equations can be solved. Matrix D = 
X T X + + u 2 L J L is a p x p matrix, independent of the optimization variables. For small p, we can invert 
D and store D~ x in the memory, so the linear equations can be solved with minimal cost. However, for large 
p, we will need to numerically solve the linear equations at each iteration. 

The matrix P = n\I + /j,2L t L occurring in (14) is a tridiagonal positive definite matrix, and as such, 
the linear equation (pL\I + p,2L T L)x = g can be solved efficiently for any x, g G MP, requiring only order of 
p operations. More specifically, there exists a matrix L satisfying L,j = for all i ^ j and i ^ j - 1 such 
that (fii + 1)1 + pL2L T L = LL T . This decomposition can be achieved by using Cholesky factorization. Thus 
solving equation (15) is equivalent to solving two systems of linear equations Lz = X T y + ni(a k — fa u k ) + 
H2L T (b k — l^2 lyk ) and L T f3 k+1 = z. These two equations can be easily solved due to the special structure 
of L. 

The linear system (2.3) is very special for large p, small n problems in that X T X will be a low rank 
matrix with rank at most n. In combination of the special structure of matrix P mentioned above, we use 
preconditioned conjugate gradient algorithm (PCG) to solve (2.3). 

The PCG algorithm [24] computes an approximate solution of the linear equations i/x = g using a 
preconditioner P, where both H,P £ M. pxp are symmetric positive definite. For the linear equation (14), we 
use preconditioner P = U\I + /i2^ T L and the PCG algorithm converges in less than n steps. In our numerical 
implementation, we found that PCG converges in a few steps much smaller than n. 

2.3. Split Bregman for FLSA (SBFLSA) 

The fused Lasso signal approximator (FLSA) problem introduced by Friedman et al. corresponds to a special 
case of the standard fused Lasso with the predictor variables X being an identify matrix. Therefore, for FLSA, 
the primal problem minimizing /3 is simply 

(( Ml + 1)1 + n 2 L T L)p k+1 = y + m{a k - fi^u k ) + p 2 L T {b k - M ^ V), (15) 

while other iterations stay the same. 

As mentioned in the preceding section, matrix (/xj + 1)/ + [i2L T L is a tridiagonal positive definite matrix, 
so (15) can be solved very efficiently, requiring only order of p operations. Therefore, we have a fast solver 
for SBFLSA using the split Bregman iteration in Algorithm 1. 

2-4- Iterative algorithm for fused Lasso Support Vector Classifier (SBFLSVM) 

Next we derive a split Bregman algorithm for the fused Lasso support vector classifier (FLSVM) introduced 
by [1]. FLSVM uses a hinge loss function [25] for two-label classification problems. It finds the optimal 
classification coefficients (/3o,/3) that minimize 

1 - 

/(A), P) = - YYl - y i ( y S T x i + /3 ))+ + + \2\im\r, (16) 

n * — ' 

where = maxjit, 0} for any «el and L is the difference operator defined in (13). 

Because the hinge loss function (1 — 1) + is not diffcrcntiable, the primal problem involving f3 is now more 
difficult to solve. As a result, we will not directly apply Algorithm 1 to solve FLSVM. Instead, we introduce 
additional variables to deal with the nondifferentiability of the hinge loss. 

Let Y be a diagonal matrix with its diagonal elements to be the vector y. The unconstrained problem in 
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(16) can be reformulated into an equivalent constrained optimization problem 

1 - 

min - V7cj)+ + Xi\\a\\i + A 2 ||6||i 

Rr. n h r- 71 ' ' 



/3,l3 ,a,b,c Tl 

i—1 

s. t. P = a 

Lf3 = b 

1 - YX/3 - p y = c, 



where 1 is an n-column vector of Is. 

The augmented Lagrangian function of (17) is 



1 - 

C(/3,/3 ,a,b,c,u,v,w) = - V(q)+ + X 1 ||a||i + A 2 ||6||i + (u,p-a) + (v,Lp-b)+ 

71 ' 



(17) 



(w, l-YXp-p -c) + ^\\p - ag + ^\\LP - b\\ 2 2 + ^||1 - YX/3 - p - c|| 2 , 

where u, v, w are dual variables corresponding to linear constraints ft = a, LP = b, and 1 — YXP — Pay = c 
respectively. Positive reals \x\, [12 and ^3 are the penalty parameters for the violation of the linear constraints. 

Similar to the derivation of Algorithm 1, we find the saddle point of £ by iteratively updating the primal 
and dual directions 

f (P k+1 , /3 fc+1 ) = argmiii AA > fe ,/3 - a k ) + (v k ,LP - b k ) + (w k ,l- YXP - p Q y - c k ) 
+W - + f W - b k \\l + fill - YXP - p y - c k \\l 
a k+1 = argmin Q X^ + (u k ,P k+1 - a) + ^||/3 fc+1 - a|||, 
b k+1 = argmin b A a ||6||i + (v k ,Lp k+1 - b) + f\\Lp k+1 - 



argmin c ± Eti(*)+ + (w k , 1 - YXp k ^ - /3 fe+1 j/ - c) 
YXP^-p k Q +1 y-cf 2 



(18) 



i fc+1 
,fe+i 



-f||l 
fc+i _ 



v k + S 2 (Lp k+1 -b k+1 ), 



ir 



k+l 



w 



S 3 (l~YXp k 



fe+i 



P k +1 y 



„k+l 



The update for a k+1 , b k+1 , u k+1 , v k+1 , w k+1 are almost the same as the one in Algorithm 1, so we focus on 
the update for {P k+1 , Pq +1 ) and c k+1 . In Supplementary Information we show that PCG can still be applied 
to solve the updating of (/3 fe+1 , Pq +1 ) with some modifications. 

To update of c fe+1 in 18, we use the following proposition, which is proven in the Supplementary Info. 



Proposition 1. Let s\(w) — argmin^gR Xx + + h\\x — w\\\. Then 

{w — X, w > X, 
0, < w < A, 

w, w < 0. 

With Proposition 1, we can then update c k+1 in 18 according to 

Corollary 1. c k+1 = S^_(l — YXp k+1 — Pn +1 y + fJ, 3 1 w k ) is the solution of equation (??), where 

71,23 

S x (w) = (sx{wi), sx{w 2 ), • • • , s\(w n )), Vw e K" 

with s\ defined by (19). 

Proof. The equation (??) is equivalent to 



(19) 



argmin J- + h\l- YXp k+1 - p k+1 y - c + fi^g. 



(20) 



Note that each element of c is independent of each other in (20), we can get the desired result by using 
Proposition 1. □ 

In summary, we derive Algorithm 2 to solve (16). 

Algorithm 2: Split Bregman method for FLSVM 

Initialize /3°, /3g, a , 6°, c°, u°, v° , and w°. 
repeat 

1) Update /3 fe + 1 , /Sg" 1 " 1 by solving the linear equations: 
(Hil + ^L T L + n 3 X T Y 2 X n 3 X T Yy\ ( 7? fc+1 \ 

V ^y T YX u 3 y T y ) {/3 k + 1 J 

= Ml ~ f V ) + M2 ( L Q T ) (b k - ^ V) + u 3 (1 ~ C fc + 

2) a^ 1 =T-i Ai (/3 fc+1 +A t ~ 1 « fc ) 

3 ) fe fc+1 =r^ lA2 (L/3 fc + 1 + ^ 1 ^) 

4) c k+1 = S i (1 - yX/3 fc+1 - ^o +1 3/ + H 1 w k ) 

5) u fe+1 = u k + S 1 (L/3 k+1 - a k+l ) 

6) i; fe + 1 = i) fe + 5 2 ( / 3' : + 1 - v k+1 ) 

7) io fc + 1 = w k + <5 3 (1 - yX/3 fc + 1 - (3 k+1 y - c k + l ) 
until 

Convergence 

The convergence property of Algorithm 2 is shown in the following theorem, which we prove in the 
Supplementary Info. 

Theorem 2. Suppose there exists at least one solution (3* of (16). Assume < S\ < /^i,0 < 82 < ^2 
Ai > 0, A2 > 0. TTien £/ie following property for Algorithm 2 holds: 

y l (xf/3 fe +^)) + + A 1 ||/3 fe || 1 + A 2 ||L/3 fc || 1 

2/ 4 (xf/T + /?*))+ + Ar||r Id + A 2 ||L0*||i. (21) 
hm ||/3 fe -/3*||=0 (22) 

fc— >00 

whenever (16) /ias a unique solution. 
3. Experimental Results 

Next we illustrate the efficiency of split Bregman method for fused Lasso using time trials on artificial data 
as well as real-world applications from genomics and proteomics. All our algorithms were implemented in 
Matlab, and compiled on a windows platform. Time trials were generated on an Intel Core 2 Duo desktop 
PC (E7500, 2.93GHz). 

As the regression form of the fused Lasso procedures is more frequently used, we will thus focus on testing 
the performance of SBFLasso and SBFLSA. To evaluate the performance of SBFLasso, we compare it with 
SQOPT and CVX. SQOPT [5] is used in the original fused Lasso paper by Tibshirani et al. [1]. It is a 
two-phase active set algorithm, designed for quadratic programming problems with sparse linear constraints. 
CVX is a general convex optimization package [6]. SQOPT and CVX solve the fused Lasso by introducing 
additional variables and constraints to transform the nondiffcrentiable objective function into a smooth one. 
Both solvers are implemented in Matlab, and thus are directly comparable to our implementation. SQOPT 
allows warm start, so we will use it whenever possible. To evaluate the performance of SBFLSA for solving 
FLSA, we mainly compare it with the path algorithm proposed by Hocfling [9]. 



1 

lim 

fc-s-oo n — ' 

= i E( 1 - 



i=l 
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The stopping criterion of SBFLasso is specified as follows. Let <^(P k ) ^ ^\\X /3 k ~y\\l+X 1 \\(3 k \\ 1 +X 2 \\Ll3 k \\ 1 . 
According to Theorem 1, lim^oo <J>(/3 fc ) = $(/?*). Therefore, we terminate SBFLasso when the relative 
change of the energy functional i ||^/3 — y\\\ + Ai||/3|ji + A 2 ||L/3||i falls below certain threshold 5. We used 
S = 10~ 5 in our simulation, i.e., we stop the Bregman iteration whenever 

Note that the convergence of Algorithm 1 is guaranteed no matter what values of fii and fj,2 are used as 
shown in Theorem 1. The speed of the algorithm can, however, be influenced by the choices of fj,i and /i2 
as it would affect the number of iterations involved. In our implementation, we choose the parameter values 
using a pretrial procedure in which we test the convergence rate for a set of parameter values and identify 
the one that gives rise to the highest convergence rate. For regression problems, we always set /ii = p 2 and 
select values from the set {0.2,0.4,0.6,0.8,1} x ||y||2- The parameter selecting procedure can certainly be 
further improved, but empirically we find it works well for all the problems we tested. 



3.1. Artificial data 

3.1.1. Solving fused Lasso 

We generated Gaussian data with n observations and p predictors, with each pair of predictors Xi, Xj(i ^= j) 
having the same population correlation p. The outcome values were generated by Y = ~YTj=i PjXj +e, where 
e is the Gaussian noise with mean and variance a. The regression coefficient (3 = (/3i, . . . ,/? p ) is a sparse 
vector with the values of (3j are generated according to 

!2, i = 1,2..., 20, 121, 122,..., 125 
3, i = 41 
1, i = 71, 72,... ,85 
0, else. 

The design of (3 is motivated by the fact that fused lasso are especially suitable for coefficients that are 
constant for an interval and change in jumps. An example plot of (3 of size p = 500 can be seen in Figure 
1(a). 

Table 1 shows the average CPU time for CVX, SQOPT and SBFLasso to solve the fused Lasso problems. 
SBFLasso consistently outperforms CVX in all cases we tested with a speedup of at least ten fold. Note 
that CVX fails to obtain results for large p problems due to out of memory errors. Although it has similar 
performance to SQOPT for small size problems (p ~ 200), SBFLasso is significantly faster than SQOPT 
for large p problems. For problems of n = 200, p = 20000, SBFLasso is able to obtain the optimal solutions 
within ~ 30 seconds, while it takes about 800 seconds for SQOPT to obtain the similar results. Overall, our 
algorithm is about twenty times faster than SQOPT for the large p problems. 

To evaluate how the performance of SBFLasso scales with problem size, we plotted the CPU time that 
SBFLasso took to solve the fused Lasso problem as a function of p and n. Figure 2 shows such a curve, 
where CPU time is averaged over 500 runs with different parameters Ai,A2 and different design matrix X. 
We note that the CPU times are roughly linear in both n and p. 

A key to the success of SBFLasso is that we split the regularization terms ||/3||i and ||£/3||i and make 
the minimization problems separable. Due to the soft thresholding in the Bregman iteration, the solutions 
obtained by SBFLasso arc naturally sparse as we can see from Figure 1(b). This is contrast to solutions 
obtained by CVX and SQOPT, because no thresholding steps are involved, solutions obtained by these 
two algorithms are not sparse, and sparseness can only be achieved through a thresholding step in the 
postprocessing. 



3.1.2. Solving FLSA 

Next we compare SBFLSA and the path algorithm (PATHFLSA) [9] for solving FLSA. PATHFLSA uses a 
fusion algorithm to solve FLSA, taking advantage of the special structure of the error term. It represents 
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Table 1 

Run times (CPU seconds) for fused lasso problems of various sizes p and n, different correlation p between the features. 
Methods are SQOPT, CVX and SBFLasso. The results are averaged over 100 runs (using 4 different predictor matrix X and 

25 different values of regularization parameters \i,\2 (or si,S2). 









n=100 


n=100 


n=200 


n=200 


n=200 


n=200 


p 




Method 


p=200 


p=1000 


p=2000 


p=5000 


p= 10000 


p=2000U 








time(sec) 


time(sec) 


time(sec) 


time(sec) 


time(sec) 


time(sec) 






CVX 


0.496 


2.646 


18.148 


64.453 






p = 





SQOPT 


0.0334 


0.510 


5.738 


39.269 


147.534 


> 600 






SBFLasso 


0.0366 


0.155 


1.488 


5.845 


12.724 


28.441 






CVX 


0.523 


2.792 


16.812 


61.914 






p = 


0.2 


SQOPT 


0.0323 


0.572 


6.812 


47.196 


205.365 


> 600 






SBFLasso 


0.0352 


0.323 


2.831 


9.716 


18.249 


34.061 






CVX 


0.518 


2.719 


16.504 


63.456 






p = 


0.4 


SQOPT 


0.0299 


0.611 


6.063 


48.010 


203.973 


> 600 






SBFLasso 


0.0338 


0.265 


2.803 


8.897 


24.680 


26.990 






CVX 


0.510 


2.856 


17.020 


62.920 






p = 


0.6 


SQOPT 


0.0312 


0.519 


6.508 


45.339 


197.794 


> 600 






SBFLasso 


0.0286 


0.143 


2.190 


8.947 


20.586 


36.157 






CVX 


0.511 


2.995 


19.379 


68.425 






p = 


0.8 


SQOPT 


0.0293 


0.527 


5.678 


41.147 


178.208 


> 600 






SBFLasso 


0.0190 


0.221 


1.426 


6.446 


15.505 


41.614 
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200 



300 



400 



500 



(a) 



3r 
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1 

0.5- 

0- 
-0.5- 
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SBFLasso 



100 



200 300 
P 

(b) 



400 



500 



Fig 1: (a) The figure of coefficient (3 in 500 dimension; (b) The biue line is the solution derived by SBFLasso with 
Ai = 16, A2 = 20 and the red line is the original (3 in 500 dimension. 
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Fig 2: CPU times for SBFLasso for the same problem as in Table 1, for different values of n and p. In each case the 
times are averaged over 500 runs, (a) n is fixed and equals to 200; (b) p is fixed and equals to 5000. 

the state of the art for solving the FLSA problem. We generate data according to y = j3 + e, where e is the 
Gaussian noise with mean and variance a, and (3 is a sparse vector which has similar shape as the one 
shown in Figure 1(a) with more nonzero entries. We vary p from 10 3 to 10 6 and the results for each p are 
averaged over 10 runs. 

Table 2 shows that the computational times of SBFLSA and PATHFLSA for solving the FLSA prob- 
lems. We note that the performance of SBFLSA is similar to PATHFLSA in almost all cases we tested, and 
both algorithms significantly outperforms SQOPT with thousands of times faster for large p problems. 

Table 2 

Run times (CPU seconds) for an 1- dimensional FLSA problems of various sizes p. Methods are SQOPT, SBFLSA and path 
algorithm for FLSA(PATHFLSA). The results are averaged over 10 runs. 



parameters 




Method 


p = 10 4 


p = 10 s 


p= 10° 






SQOPT 


106.97 


> 5 hours 




Aj = 0.1, A 2 


= 0.8 


SBFLSA 


0.053 


0.754 


8.681 






PATHFLSA 


0.050 


0.651 


8.685 






SQOPT 


107.74 


> 5 hours 




Aj = 0.2, A 2 


= 1.0 


SBFLSA 


0.053 


0.820 


8.263 






PATHFLSA 


0.051 


0.653 


8.678 






SQOPT 


108.12 


> 5 hours 




Aj = 0.3, A 2 


= 1.2 


SBFLSA 


0.053 


0.798 


9.289 






PATHFLSA 


0.049 


0.654 


8.657 






SQOPT 


106.13 


> 5 hours 




Aj = 0.4, A 2 


= 1.5 


SBFLSA 


0.053 


0.806 


9.892 






PATHFLSA 


0.049 


0.651 


8.661 



Although the performance of SBFLSA and PATHFLSA are similar for fixed Ai and A 2 , PATHFLSA has 
an additional advantage of generating solutions for a path of the regularization parameters. However, because 
PATHFLSA works by fusing variables, a necessary condition for it to work is that the solution path has to 
be piece- wise linear when varying X 2 . This condition is in general not true for both the fused Lasso and the 
generalized fused Lasso. As such, it cannot be applied to these cases. 

3.2. Mass spectrometry data 

Mass spectrometry (MS) holds great promise for biomarkcr identification, and genome wide metabolic and 
proteomic profiling. The protein mass spectroscopy application was used as a motivating example for fused 
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Lasso in the paper by Tibshirani et al. [1]. Next we illustrate the efficiency of SBFLasso for solving the 
fused Lasso problem for mass spectrometry data. The data we use is taken from [26]. It consists of MS 
measurements of 95 normal samples and 121 samples taken from patients with ovarian cancer. The raw data 
contains a total of 368, 750 mass-to-charge ratio (m/z) sites. 

We first preprocessed the data using the procedure described in [26], consisting of the following three 
steps: 1) re-sampling: Gaussian kernel reconstruction of the signal in order to have a set of d-dimensional 
vectors with equally spaced mass/charge values; 2) baseline correction: removes systematic artifacts, usually 
attributed to clusters of ionized matrix molecules hitting the detector during early portions of the experiment, 
or to detector overload; 3) normalization: corrects for differences in the total amount of protein desorbed 
and ionized from the sample plate. The average profiles from normal and cancer patients after preprocessing 
are shown in Figure 3. 

Table 3 

Run times (CPU seconds) for SBFLasso, SQOPT and CVX on MS data for different values of the regularization parameters 

Ai and A2 . 



parameters 




10-CV error 


SBFLasso 


SQOPT 


CVX 


Ai = 2.0, A 2 


= 3.5 


6/216 


2.9854 


31.707 


24.481 


Ai = 2.5, A 2 


= 4.5 


8/216 


3.6612 


30.310 


23.456 


Ai = 3.0, A 2 


= 1.0 


6/216 


3.2082 


35.911 


22.261 


Ai = 3.5, A 2 


= 2.5 


9/216 


3.5080 


32.094 


21.130 



- Control 

- cancer 



500 1000 1500 2000 2500 3000 
m/z 

(a) 



0.04 
0.03 
0.02 
0.01 


-0.01 
-0.02 
-0.03.; 



-cvx 

- SBFLASSO 



500 1000 1500 2000 2500 3000 
m/z 

(b) 



Fig 3: (a) Protein mass spectroscopy data:average profiles from normal (blue) and cancer patients (red); (b) Estimated 
/3 from protein mass spectroscopy data with Ai = 2, A2 = 3.5 by CVX (blue) and SBFLasso (red). 



We apply the fused Lasso (1) to the MS data to select features (m/z sites) that can be used to predict 
sample labels. For each sample, the response variable is either 1 or —1, and the predictor variable is a 
vector consisting of the intensity of p = 3000 sampled m/z sites. We used SBFLasso to solve the fused 
Lasso problem, and compared its performance to CVX and SQOPT. The results are summarized in Table 3, 
which shows the computational times spent by different solvers in a ten-fold cross-validation procedure for 
different parameters Ai and A2. SBFLasso is consistently many times faster than CVX and SQOPT, with 
an approximately ten-fold speedup in almost all cases. The coefficients derived by SBFLasso and the other 
solvers are very similar (Figure 3b), but SBFLasso is able to achieve a sparser solution, and in addition a 
slightly lower objective function than CVX. 
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3.3. Comparative genomic hybridization (CGH) data 



In tumor cells, mutations often cause a large DNA segment to be deleted or inserted in a chromosome, in a 
phenomena called copy number variation (CNV). Array CGH is a technique that is used to detect CNVs in 
a genome by labeling DNA from a test sample and normal reference sample differently using fluorophores 
and hybridizing to genomcwide probes. The log ratio of the fluorescence intensity of the test DNA to the 
reference DNA is then calculated for each probe. A value greater than zero indicates a possible gain in DNA 
copies of the region around the probe, while a value less than zero suggests a possible loss. [3] demonstrated 
the efficiency of the fused lasso signal approximator (FLSA) for detecting CVNs using array CGH data. 
Next, we will show that SBFLSA is an efficient tool for solving the FLSA problem for array CGH data. 

Table 4 

Run times (CPU seconds) for SBFLSA on CGH data for different values of the regularization parameters Ai and A 2 . 



parameters 




SBFLSA 


PATHFLSA 


SQOPT 


CVX 


Ai = 0.10, A 2 


= 3.0 


0.007 


0.006 


0.7273 


0.7124 


Ai = 0.12, A 2 


= 3.5 


0.007 


0.005 


0.6193 


0.6451 


Ai = 0.15, A 2 


= 3.0 


0.007 


0.006 


0.6161 


0.6604 


Ai = 0.18, A3 


= 3.2 


0.006 


0.005 


0.6370 


0.6244 




400 600 
genome order 

(a) 



800 1000 



200 



400 600 
genome order 

(b) 



800 1000 



Fig 4: Fused lasso applied to some GBM data. The data are shown in the left panel, and the solid red line in the 
right panel represents the inferred copy number /3 from SBFLSA. The gray line is for y = 0. 

We used the glioblastoma multiforme (GBM) data from [27], which contains array CGN profiling of 
samples from primary GBMs, a particular malignant type of brain tumor. Table 4 shows the CPU times 
spent by SBFLSA to solve the FLSA problem for different regularization parameters Ai and A2. We observe 
that our method is significantly faster than SQOPT and CVX, with a speed improvement of about 100 times. 
The performance of SBFLSA is also comparable to the path algorithm, which is specially designed, the state 
of the art for solving FLSA problems. Figure 4(b) plots the copy number variants detected by SBFLSA with 
Ai = 0.10 and A2 = 3.5, clearly showing the gain of DNA segments in two nearby chromosomal regions in 
GBM. 

4. Discussion 

Fused Lasso is an attractive framework for regression or classification problems with some natural ordering 
occurring in regression or classification coefficients. It exploits this naturally ordering by explicitly regu- 
larizing the differences between neighboring coefficients through an l\ norm regularizer. Solving the fused 
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Lasso problem is, however, challenging because of the nondifferentiability of the objective function and the 
nonseparability of the variables involved in the nondifferentiable terms of the objective function. Existing 
solvers circumvent these difficulties by reformulating the fused Lasso problem into a smooth constrained 
convex optimization problem by introducing a large number of auxiliary variables and constraints, and as 
such, are only able to solve small or medium size fused Lasso problems. 

We derived an iterative algorithm based on the split Brcgman method to solve a class of fused Lasso 
problems, including SBFLasso for the standard fused Lasso, SBFLSA for the fused Lasso signal approximator, 
and SBFLSVM for fused Lasso support vector classifier, and proved their convergence properties. Preliminary 
experimental results for SBFLasso and SBFLSA show their efficiency for large scale problems, especially for 
problems with large p, small n, which occur in many real-world applications. 

The iterative algorithm we propose is very easy to implement, involving only a few lines of code. It is 
also very general and can be adapted to solve a variety of fused Lasso problems with minor modifications as 
we have shown. In this aspect, it is very different from the path algorithm, which is specially designed for 
FLSA and requires significant amount of domain specific knowledge. Because of its simplicity and generality, 
we expect the split Bregman iterative algorithm would find its usage in a wide range of other l\ related 
rcgularization problems. 
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Appendix 



A.l. Convergence analysis of Algorithm 1 

We use some similar ideas from Cai et al. [15] to prove Theorem 1. Different from the case of Cai et al. [15], 
we do not require V(j3) to be diffcrentiable, and treat the nondiffcrcntiability of V((3) explicitly by using its 
subgradient vector h G dV((3), see the proof for details. 

Proof of Theorem 1: Since all the subproblems involved in (8) are convex, the first order optimality 
condition of the Algorithm 1 gives 

'0 = h k+1 + u k + L T v k + /ii(/3 fe+1 - a k ) + p 2 L T {Lp k+1 - b k ), 
X lP k+1 - u k + pi{a k+1 - P k+1 ) = 0, 



\ iq k+1 



p 2 (b 



k+i 



Lj3 k+1 ) = 0, 



(24) 



fc+i 



)■ 



= v k + 8 2 {Lp k+l - b k+1 ), 

where h k+1 G dV \P k+1 ) , p k+1 G <9||a fc+1 ||i and q k+1 G d\\b k+1 \\ 1 . 

Since /?* is a solution of (2), by the first order optimality condition, there exist h*,p*,q* such that 



h* + X lP * + X 2 L T q* = 



where h* G dV(/3),p* G d\\/3*\\i,q* G d\\b*\\i with b* 
Xip*,v* = X 2 q* , we can formulate (25) as 



= h* + u* + L T v* + - a*) + [i 2 L T (Lfi* - b*) 

Xip* -u* (a* -/?*) = 0, with p* G 9||a*||i, 
X iq * -v* + fi 2 (b* - L(3*) = 0, with 9*6 9116*11!, 
u* = u* + <5i(/3* — a*), 
u* = v* + 6 2 {L0* - b*). 



L{3*. Introducing new variables a* 
with h* G 0V(/?) 



(25) 



(26) 



Comparing (26) with (24), we can see that /3*, a*, b* , u* , v* is a fix point of Algorithm 1. Denote the errors 

by 

f3 k = f3 k - f3*, a k =a k - a*,b k = b k - b* ,u k = u k - u* and v k = v k - v* . 
Subtracting the first equation of (24) by the first equation of (26), we obtain 

= h k+1 - h* + u k + L T v k + /ii(/3 e fc+1 - a*) + [i 2 L T (L(3 k+1 - b k ). 

Taking the inner product of the left and right hand sides with respect to f3 k+1 , we have 





(h k+1 - h* , (3 k+1 - /?*) + (u k ,f3 k+1 ) + (v k ,Lf3 k+1 ) 
+^||^ 1 ||i-/ 11 (o^^ 1 >+/* a ||£^ 1 ||l-M 2 <6^^* +1 > 

Similarly, we can get 

X 1 (p k + 1 -p*,a k+1 -a*)+^ 1 \\a k+1 \\l- f i 1 (a k+ \p k+1 ) (u k e ,a k e +1 ) = 0, 

X 2 (q k+1 -q*,b k+1 -b*) + f i 2 \\b k+1 \\ 2 2-^ + \L/3 k + 1 } - (v k e> b k + 1 )=0. 
Summing (27), (28) and (29) together, we get 

(h k+1 - h*,p k+1 - /3*) + Ai(p fc+1 - p*,a k+1 - a*) + X 2 (q k+1 - q*,b k+1 - b*) 



(27) 

(28) 
(29) 



-HiiWr'Wl + \We +1 \\l - <# +1 ,a* + ^ +1 )) + ^ 2 (\\L(3 k+1 \\l + \\b k+1 \\ 



k+l 1 1 2 
e \\2 

fc+i tJ> + ( u k ,(3 k+1 - a k+1 ) + (v k ;,Lf3 k+1 

15 



b k + b K 



) = o 



(30) 



Furthermore, by subtracting the fourth equation of (24) by the one of (26), we have 

u k ^=u k + Si(^-a k ^). 
Taking square of both sides of the above equation implies 



Similarly, we have 



(«*, ^ - a^) = ^(Wu^Wl \\u k e \\l) - a^\\l 



(v k ,L^ - = 2^(H^ +1 |l2 ~ H\\D - ylli/e 1 - b k+1 \\l 



Substituting (31) and (32) into (30) yields 



^(ll^!l^-||^ +1 ll^) + ^(lk fc ll^-lk fe+1 ll^) 

(h k+1 - h*. : p k+1 - f3*) + Xi(p k+1 - V \ a k+1 - a*) + X 2 (q k+1 - q*,b k+1 

+M\\P k e + X + h k e +1 \\t - W k e +1 ,a k + a k ^) - ^g) 

+ Mwm^g + \\b k e +i \\i - (m + \b k + r 1 ) - ^wm +i - ^hd 



Note that for any x, y, z el p , we have 



|x||3 ± <x,y + z> + ||y||| = l\\x±y\\ 2 2 + i||x±z||l + i(||y||2 - ||i 



Using this elementary equation, (33) can be transformed to 



^(ll^!l^-!l^ +1 ll^) + ^(lk fc ll^-lk fe+1 ll^) 

(h k+1 - h\l3 k+1 - f3*) + Xi(p k+1 - p*,a k+1 - a*) + X 2 (q k+1 - q*,b k+1 

Ml Mlofc+l „fel|2 i ||_*+1||2 ||„fc||2 | Ml— jl II ofc+1 „fc+l||2 



, 11/r 1 - <u + nana - km + ^-^m +L 

l \ hi 
+f (ll^ e fe+1 -b k \\l+ \\b k+1 \\i \\b k \\l + ^l W k ^ 6£+i||») 

Summing the above equation from k = to k = K yields 

11 0| 2 _ ||„,Jf+l||2\ , J_ni,,0||2 _ ||„,X+1||2 



u e\\2 - II" 



2V euz " 6 2,5; 



H2) + ^i-(KII3-ll«f +1 ll3) 



2 



+f (Kll^ll«f +1 ll^) + f (IM - II6? +1 II2) 

^](/i fe+1 -ft*,/?^ 1 - (3*) + XiJ2(p k+1 -p\a k+1 -a*) 

k=0 k=0 

+A 2 J2{q k+1 q*,b k +l -b*) + ^Y. Wtt +1 ~ <*\\l + E 
k=Q k=0 M 1 k=0 



2 ^_,„ , e 

fe=0 ^ fc=0 
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The fact that h k+1 e dV(f3 k+1 ) 1 h* e dV((3*) and V(/3) is convex implies 

(h k+1 ~h*,f3 k+1 -13*) 
= V{p k+1 ) - V(f3*) - {h\p k+1 - (3*) + V(p*) - V{p k+1 ) - {h k+ \(3 k+1 - p*) 
> (36) 

by the definition of subgradient. Similarly, (p k+1 - p*,a k+1 - a*) > 0, (q k+1 - q*,b k+1 - b*) > 0. Together 
with the fact that < Si < fj,\ and < S 2 < /i 2 , ah terms involved in (35) are nonnegative. Therefore, 

J2(h k+1 h\[3 k ^ - p*) < ±-\\u° e \\l + ^-\\v° e \\l + ^\\a° e \\l + (fmt 



k=0 

which leads to 

Together with (36) leads to 

Similarly, we can prove 



lim (h k - h*,/3 k - /3*) = 0. 

k— ¥00 



lim V{f3 k ) - V{J3*) - (h*, (3 k - (3*) = 0. (37) 

A:— >og 

Ai lim j|a fc ||i - ||a*||i - (a k - a*,p*) = 0, (38) 
A 2 lim ||t>*||i-||6*||i-(&*-6*,O = 0, (39) 

k— >oo 

lim \\(3 k - a k \\ 2 = and lim \\L(3 k - b k \\ 2 = 0. (40) 

k— >oo k— >oo 



Since || ■ ||i is continuous, by (38), (39) and (40), we obtain 

Qk II || o* || / ok 



X 1 lim II^IK - ||i - (f3 k - p*,p*) = 0, (41) 

k— >oo 

A 2 lim \\L(3 k \\ 1 -\\Lp*\\ 1 -(L(3 k -Lp*,q*)=0. (42) 

k— voo 



Summing (37), (41) and (42) yields 



lim V{l3 k ) + Xi\\/3% + \ 2 \\LJ3% - (V(/3*) + Ai||/3*||i + A 2 ||L/9*lb 

k— >oo 

ik o* l * i i tT , 



+ k - P*,h* +p* + L'q*) = 
This together with (25) proves 

lim V((3 k ) + \ 1 \\[3 k \\ 1 + \ 2 \\L[3 k \\ 1 =V(f3*) + \ 1 \\(3*\\ 1 +\ 2 \\L[3*\\ 1 . (43) 

k— >oo 

Next, we prove that 

lim ||/3 fc -/r|| 2 = (44) 

k— >oo 

whenever (2) has a unique solution. 

It is proved by contradiction. Let $(/9) = V((3) + Ai||/3||i + A 2 ||£/3||i. Then $(/9) is a convex, lower 
continuous function. Since /?* is the unique minimizer, we have $(/3) > $(/?*) for all (3 ^ (3* . If (44) does 
not hold, there exists a subsequence /3 fei such that \\j3 ki — /9*|| > e for some e > and for all i. Then 
<i>(/? fei ) > min{$(/3) : - f3*\\ 2 = e}. Indeed, let 7 be the intersection of the sphere {f3 : \\/3 - f3*\\ 2 = e} and 
line segment from (3* to (3 ki , then there exists a positive number £ £ (0, 1) such that 7 = tf3* + (1 — t)(3 ki . 
By the convexity of $ and the definition of (3* , we have 

<5>(l3 k ') > t<l>(l3*) + (l-t)<S>((3 k *)><l>(tl3* + (l-t)p k >) 
= $( 7 )>mm{$(/3):||/3-/r|| 2 =e}. 

Denote f3 = argmin{$(/3) : \\(3 — f3*\\ 2 = e} . By applying (43), we have 

$09*)= lim $(^ fe *) > $(/3) > $(/3*), 

i— J- 00 00 

which is a contradiction. □ 
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A. 2. Convergence analysis of Algorithm 2 

Now we give the proof of Theorem 2. The main idea is the same as the one in [15]. However, due to the extra 
bias term /3q and the hinge loss, the terms involved in our proof are more complicated. 

Since all the subproblems involved in (18) are convex, the first order optimality condition gives 

f f^I + ^ 2 L T L + ^X T Y 2 X n 3 X T Yy\ 



^y T YX 



fj-3y T y 



fc+i 



Mi 





\ lP k+1 - u k + ^i{a k+l - f3 k+1 ) = 0, 
X 2 q k+1 - v k + n- 2 (b k+1 - Lf3 k+1 ) = 0, 

M3 (c fe+1 + YXp k+1 + f3 k+1 y - 1) = 0, 
= u k + S 1 (/3 k+1 -a k+l ), 
■ k+1 = v k + S 2 (Lp k+1 -& fc+1 ), 



2_ s k+l _ w k 

n 

n,k+l — ll k 



„k+l 



+ 1 = w k + <y 3 (l - YX(3 k+1 - p k+L y - c fe+1 ), 



(l-c k + Hz L w k ) 



(45) 



where p k+1 e d\\a k+1 \\ 1 ,q k+1 E 9||fe fc+1 ||i and s k+1 e d\ 



Jc+1\ 



This simple observation will be used in our 



proof for the convergence of SBFLSVM. 

Proof of Theorem 2. Let (/3* , /3q ) be an arbitrary minimizer of (16). By the first order optimality 
condition, there exist p*, q* and s* such that 



-±X T Ys* + \ lP * + \ 2 L T q* = 0, 

y T s* = o, 



(46) 



where s* € <9(c*)+ with c* 
Introducing new variables a* 



l- yi (xjp* +p%),i = l,...,n,p* e q* e d\\b*\\i with b* = Lp*. 

- p*,u* = Xip*,v* = X 2 q* and w* — -^s*, we can formulate (46) as 



Vi/ + ^L T L + fi 3 X T Y 2 X [i 3 X T Yy 

^y T rx 



^y T y 





o 




(1-C*+^V) 



Xip* -u* +/ii(a* -/?*) = 0, with p*ed||a*||i 
A 2 ?*-«* + A( 2 (6*-£j8*)=0 J with g*ea||6*|| a 
±s*-w* + fi 3 (c*+YXp* + PZy-l)=0, with s* G 0(££ =1 (c i ) + ) 
u* = u* + 8i(P* - a*), 
u* = v* +6 2 (L(3* -b*), 
w* = w* + 8 3 (1 - YX/3* - fifty - c*). 

Therefore, /3*, /3q, a*, 6*, c*, u*, w*, u>* is a fixed point of (45). Denote the errors by 

Pi = p k - P\p k e = P k - P*,a k = a k - a\b k = b k - b* , 



c e " = c " ~ c ' ,u k = u'" — u ' , Ug = tr — u" and u; e " = w" — w' 
Subtracting the first equation of (45) by the first equation of (47), we obtain 

^I + fi 2 L T L + fi 3 X T Y 2 X n z X T Yy\(p k+1 



* k 

u ,v e 



v k — v* and w k 



Mi 



Msy J YX 



Ml l U k e 



^y T y 



Po 



k+1 



M2 







(&e~M2 V)+M3 
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(-c* + m 3 <) 



(47) 



Taking the inner product of the left and right hand sides with respect to {{fi k+1 ) T , ftot 1 ) T i we have 
(( Ml 7 + m L T L + ^X T Y 2 X)(3 k+1 + ^X T Yy$t\Pe +1 ) + ^{YXp k+1 + y/3&\ y^ 1 ) 



= ^(a k e , (3 k+1 ) - («*, (3 k+1 ) + ^ 2 (b k ,L(3 k+1 ) (v k , L? k+1 ) 



V 3 {c k , YXp k+1 ) + (w k , YXp k+1 ) - M3<c*, y/3 fc e +1 ) + ^tvttt 1 )- (48) 



The same manipulations applied to the second (third, fourth) of equation (45) and the second (third, fourth) 
of equation (47) lead to 

Ai (p k+1 - V* , a k+1 -a*)+ to \\a k+1 \\j - (u k , a k+1 ) - Ml (f3 k+1 , a k+1 ) = 0. (49) 

h(q k+1 -q\b k+l -b*)+fx 2 \\b k e +1 \\l (v k ,b k+1 )-n 2 (Lf3 k+ \b k+1 )=0. (50) 

l(,s fc+1 - s*,c k+1 c*)+^\\c k+1 \\l - (w k ,c k+1 ) +^(YXf5 k+1 + f3 k + 1 y,c k+1 ) = 0. (51) 

By summing equations (48), (49), (50) and (51), we get 

Ai(/ +1 -P\a k+1 -a*) + X 2 (q k+1 -q*,b k+1 -b*) + -(s k+1 -s*,c k+1 - c*> 

n 

(P k+ \a k+1 +a k ) + ||o* +1 ||2) 
+/i 2 (||£/3Mi - ^ + b k ,Lf3 k+1 ) + Wb^Wl) 

+ f , 3 (\\YXp k+1 + \\t + ^ 3 (YXf3 k+1 + yP k +\c k+1 + c k ) + \\c k+1 f 2 ) 

+ {u k ,p k + 1 - a k+1 ) + (v k , Lf3 k+1 b k+l ) {w k ,YXf3 k+1 + y^ 1 + c k+1 ) = 0. 



Furthermore, subtracting the fifth equation of (45) by the fifth equation of (47), we have 

„fc+i _ „* . x (ak+l n k+l\ 

which leads to 



Similarly, we can get 



and 



Substituting (53), (54) and (55) into (52) yields 

^(ll^lll - + ^(II^III - lk ft+1 llf) + ^ 



Ai(/ +1 -p*,a k+1 - a*) + X 2 (q k+1 - q*,b k+1 - b*) + -(s k+1 - s*,c k+1 - c*> 

n 

k+l\\2 lok+1 „k+l , „k\ , ||„fc+l||2 $1 || pk+1 h+1 \\2\ , ,. / II t ok+1 i|2 



-m+\b k +^+bi) + \\b k ^wi ^-jm+i -6^ni) +^{\\Yxp k ^ +y ^\\i 

+ (YXf3 k + 1 +yf3 k +\c k+1 +c k ) + \\c k+1 \\l - A-JYX (3 k+1 + (3 k ^y + c k+1 \\l 
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(52) 



(^,L/3 e fc+1 - e 1 ) = ^(IK'e +1 H^ - ~ y - ^ +1 H1 (54) 



—(W^Wl- + ^wyx^ + ^v + c^WI (55) 



(56) 



Using the elementary equality (34), (56) can be transformed to 

L ~ \\u k e +1 \\ 2 2 ) + ^{\\v k e \\l - \\v k+1 \\j) + ^-(H\\ 2 2 



It' 



2tfi Ml elU n e ,u, 2tf 2 6 25 3 

Ai(/ +1 -p*,a k+1 -a*)+X 2 (q k+1 - q\b k+1 -b*) + -(s k+1 -s*,c k+1 -c*) 

n 

+f ( Htf 4 " 1 - + \\a k e +1 \\l + ^-^H/T 1 - aPX 



Mi 



2 

M3 



Summing the above equation from k = to k = K yields 

^ihlWl \We +1 \\D + 2^(K°ll2 - ||«f +1 ||S) + ^(KHa - 

+f (ii«2iii - \\^ +l \\D + ^mwi - iiif +l \\D + f (iicgm - iicf +i ii^) 

if if 



(57) 



' 2 e enz 2 

fc=0 fc=0 fc=0 

-f n^e +1 - 6£ii2 + E n^r 1 - e 1 ill 

fc=0 fe=0 



£ ll^/3 P fe+1 + V& 1 +C k e \\l+ E W YX ^ +1 + Poe'v + C " 

fe=0 fc=0 



(58) 

The fact p k+1 e <9||a fe+1 ||i,p* e 9||a*|| and || • ||i is convex implies (p k+1 - p*,a k+1 - a*} > 0. Similarly, 
(q k+1 - q*,b k+1 - b*) > 0, (s k+1 - s*,c k+1 - c*) > 0. Therefore, all terms involved in (58) arc nonnegative. 
Now we can cheat each term in the right hand side of (58) separately by the same argument as the proof of 
Theorem 1 and get the convergence result (21). The proof of (22) can also follows the same line as the one 
of Theorem 1, we omit the details here. 

A. 3. Updates in SBFLSVM 

A. 3.1. Update of 

Due to the extra bias term of /3g +1 , we need to solve the following linear system which is slightly different 
from (14). 

^I + ^L T L + ^X T Y*X ^X T Yy\(P k+1 
^y T YX my T y ) {(3 k+1 

Mi " ^ V ) + Ms ( L T ) (b k M2 - V) + A , 3 (*7) (1 - c k + ^w k ) 

(59) 

20 



Fortunately, this linear system can also be solved by PCG efficiently Note that 



(Hil + n 2 L T L + ij 3 X t Y 2 X nsX T Yy 

V fj-3y T YX vzy T y 

^I + fi 2 L T L \ fX T Y\ (vy fO 

Ma» T »J +A "U T J (yX ' y) -V0 



It is easy to see that ( j* 2 ^ *~* | is still a tridiagonal matrix and 

V tov vj 

is a low rank matrix. So PCG is still a good solver for the linear system (59). 



A. 3.2. Proof of Proposition 1 

Proof. The energy function Ax+ + ^||x — io|[| is strongly convex, hence has a unique minimizer. Therefore, 
by the subdifferential calculus (c.f. [28]), s\ is the unique solution of the following equation with unknown w 

e Xd(x+) + x- w, (60) 

where d(x + ) = {p G K : y + — x + — (y — x)p > 0,Vy G R} is the subdifferential of the function x+. If x ^ 0, 
then x+ is differentiable, and its subdifferential contains only its gradient. If x = 0, then <9(x+) = {p 6 K : 
?7+ — > 0,Vy G R}. One can check that d(x+) — {p : < p < 1} for this case. Indeed, for any p £ [0, 1], 
yp < 2/+ by using the definition of y+. On the other hand, if there exists a number p £ (-co, 0) U (1, +oo) and 
p £ d(x+), then we can easily get a contraction. Actually, the fact p £ (— oo, 0) U (1, +oo) implies p 2 > p + . 
On the other hand, since d(x+) = {p £ R : y+ — yp > 0, £ R} for x = 0, we have p 2 < p+ by letting 
y = p. In summary, 

C 1, x>0, 
£>(*+)=< {p:p€ [0,1]}, x = 0, (61) 
[ 0, x < 0. 

With (60) and (61), we can get the desired result. □ 
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